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Abstract 

Recovering shadows is an important step for many vision 
algorithms. Current approaches that work with time-lapse 
sequences are limited to simple thresholding heuristics. We 
show these approaches only work with very careful tuning 
of parameters, and do not work well for long-term time- 
lapse sequences taken over the span of many months. We 
introduce a parameter-free expectation maximization ap- 
proach which simultaneously estimates shadows, albedo, 
surface normals, and skylight. This approach is more ac- 
curate than previous methods, works over both very short 
and very long sequences, and is robust to the effects of non- 
linear camera response. Finally, we demonstrate that the 
shadow masks derived through this algorithm substantially 
improve the performance of sun-based photometric stereo 
compared to earlier shadow mask estimation. 



1. Introduction 

Shadows are a critical component of image formation. 
They are one of the largest causes of appearance change in 
outdoor scenes. Across many problem domains, invariance 
to lighting drives choices in image pre-processing and fea- 
ture selection. 

Recently, several works have aimed to understand how 
to model outdoor image formation through time |[Tll2l[T3lL 
all of which explicitly model shadows. In each of these 
works, the authors use some variant of heuristic threshold- 
ing to estimate the shadow -or-not classification problem: 
given many images of a static scene under varying illumina- 
tion, which pixels are directly illuminated at which times? 

Previous methods are heuristics that require parameter 
tuning and still often fail for data drawn from weeks/months 
instead of just a single day. In this work we focus on 
the problem of solving for the shadow-or-not classification 
problem in real scenes as captured by outdoor webcams, 
over a variety of lighting directions and timespans, without 
parameter tweaking. 

We use an expectation-maximization approach which es- 




(a) Example image from a time- 
lapse 



(b) Algorithm from |T| 




(c) Algorithm from fT3l (d) Our approach 

Figure 1 . Given a set of time-lapse imagery |(a)[ we wish to clas- 
sify each pixel at each time as being under shadow or not. Pre- 
vious work ( |(b)| and |(c)| > demonstrates some success in simple 
thresholding heuristics, but these fail for oblique lighting direc- 
tions and poor selection of tuning parameters. Our approach |(d)| 
is parameter-free, robust to changing lighting conditions, and out- 
performs previous methods. 



timates the expected intensity of an image under direct sun- 
light and under shadow. Our approach explicitly models 
the sun as a moving light source and recovers the Lamber- 
tian world most consistent with that lighting. We find that 
this model more aptly captures the real world than simple 
thresholding, works well for very short and very long se- 
quences, and robustly handles a variety of real- world dis- 
tortions such as nonlinear camera response. 

We offer three novel contributions. First, the introduc- 
tion of a parameter-free shadow estimation procedure for 
time-lapse sequences of outdoor scenes over both very long 
and short time frames. Second, we characterize how our 
algorithm performs over varying lengths of time and under 
the effect of nonlinear radiometric response, details which 
are not modeled in our formulation in order to make the 



computation more tractable. Finally, we explore how our al- 
gorithm performs on real-world cameras and introduce the 
Labeled Shadows in the Wild dataset (7 scenes consisting 
of 50 images per scene with ground truth shadow labels) to 
quantitatively compare shadow classification tasks in real 
outdoor settings for future studies. Our code and this data 
set will be publicly shared. 

2. Related Work 

Our method can be seen as an approach for simultane- 
ous shadow estimation and photometric stereo |[T5ll . Most 
active research in photometric stereo is focused on uncali- 
brated data sets, where the light direction and intensity are 
unknown. In contrast, we use webcams with known ge- 
olocation and timestamp, so we can use a solar position 
lookup 1 12] to recover the lighting direction. 

A large body of work solves the photometric stereo prob- 
lem in the presence of shadows by treating them as noisy 
measurements. Wu et al. ifTTIl treat shadows as large-but- 
sparse errors in the Lambertian model and frame photomet- 
ric stereo as a low-rank matrix factorization problem. This 
approach offers robust estimation in the face of speculari- 
ties, sparse shadows, and large outliers. Wu and Tang [18] 
use an expectation maximization approach to simultane- 
ously solve the photometric stereo problem and estimate a 
pixel-wise weight defining if each pixel satisfies the Lam- 
bertian model. Chandraker et al. [ 3 1 isolate groups of pix- 
els that are simultaneously lit by a common light source 
through a Markov Random Field to support spatial smooth- 
ness. Sunkavalli et al. lfT4l extend this approach to the un- 
calibrated case. 

These works introduce a variety of parameters: 1 17 ] uses 
a tradeoff between a low-rank and sparse-error recovery, 
1 18 1 specifies Gaussian bandwidths and a penalization cost 
for breaking the Lambertian model [J uses a tradeoff be- 
tween a data and smoothness term, and lfT4l uses various 
thresholding parameters in a RANSAC setting. In pursuit 
of a truly automatic method, our approach contains no pa- 
rameters whatsoever. 

Our approach differs from all of the above works by 
treating shadows not as noise to be detected and ignored, 
but rather by explicitly modeling the shadow process within 
the image formation model. This is an important step for 
automatically-interpreting outdoor imagery, where the con- 
tribution of ambient light is substantial, and shadowed pix- 
els are frequent. This explicit modeling has the benefit of 
removing all parameters from the algorithm. 

Prior experiments report results on very controlled en- 
vironments, taken in a dark room with known camera set- 
tings. Notable exceptions include Ackermann et al. G) and 



Abrams et al. [lj which solve the photometric stereo prob- 
lem for outdoor webcams. Sunkavalli et al. 1 13 ] also present 
an outdoor image formation model, but because they work 
with only a single day of imagery, they only recover a 1-D 
projection of each pixel's surface normal. 

2.1. Shadow Estimation in Time-Lapse Sequences 

Most previous approaches design a threshold for each 
pixel by analyzing that pixel's intensity trajectory through 
time. This section describes our notation, the prior ap- 
proaches, and their parameters. 

We denote an image taken at time t as a p-element vector 
It between and 255, where p is the number of non-sky 
pixels. The goal is to take a set of n images I\ , . . . , I n and 
recover a sunny-or-not binary classification for each image, 
Si, . . . , S n . We index a pixel x at time t as i*(x) or St(x). 
By convention we denote that if some pixel x is directly lit 
at time t, then S t (x) = 1, otherwise S t (x) = 0. 

Factored Time-Lapse Video The approach in [QJ ob- 
serves that over the span of a day, most pixels are under 
shadow at least 20% of the time. This leads them to a heuris- 
tic that finds the median of the shadowed intensity (the 10th 
percentile pixel), then choosing a threshold at 1.5 times that 
value. This is the approach used in (2) for outdoor pho- 
tometric stereo. We later explore how various settings of 
these parameters affect results on the shadow-or-not clas- 
sification task, so we generalize their approach to handle 
arbitrary scalar multipliers Ok and percentiles 6 P : 



S t (x) <- 



fl it (x) >0 fc per (/(x)^ 
I otherwise 



CD 



lr The authors of 1 18 1 remark that most reasonable automated choices 
of this cost give approximately the same result, suggesting that no user- 
specified parameter is truly necessary. 



where pei(A,O p ) returns the of bottom p th percentile 
value of the set A of grayscale intensities, and /(x) is short- 
hand to denote {it(x)}™ =1 . 

Heliometric Stereo The approach presented in [1] also 
works by simple thresholding, but allows the threshold to 
adaptively change from frame to frame (assuming the im- 
ages are listed in chronological order). This adaptive ap- 
proach attempts to model the changing light intensity across 
seasons; this is important for long-term time-lapses because 
a shadowed pixel in the summer — where the sun is highest 
in the sky — might be brighter than a lit pixel in the winter, 
so often a single threshold does not work. 

For each pixel x, their approach defines two centroids: 
the expected intensity of that pixel when it is directly lit 
and under shadow, denoted El and Es respectively. The 
centroids El and Es are initially set by taking the top and 
bottom P percentiles of the image sequence. For each im- 
age from t = 1 —> n, if the difference from £x(x, t — 1) 



to it(x) is smaller than the difference from Esfat — 1) to 
i*(x), then update 

E L (x,t) <- E L (x,t- 1)0 X + /t(x)(l - A )- (2) 

Otherwise, update 

£ 5 (x, t) <- E s (x,t- 1)0 X + J t (x)(l - A ), (3) 

where #a G [0, 1] is a parameter that defines how quickly 
these centroids can change. Finally, this centroid-updating 
step is reversed, from t = n — » 1 to lessen the effect of 
initialization on centroids close to t — 1. The final shadow 
labeling is determined by whether the original image fits the 
expectation of a shaded or directly-lit pixel: 



St(x) <- 



1 |/ t (x)-E L (x,t)|<|/ t (x)-E 5 (x,t)| 
otherwise 



(4) 
In summary, at the span of one or a few days, the ap- 
proach in |[T3ll performs well, and over the span of a few 
months, the more complicated heuristic in (T) does some- 
what better at capturing shadows over long time periods, 
but at the cost of an additional parameter Q\ . We show in 
Section [4] that more formal modeling of the image forma- 
tion process gives improved results over even the optimal 
parameter settings. 

3. Parameter-Free Shadow Estimation 

In all cases, previous shadow estimation procedures do 
not attempt to model changing lighting direction. We ar- 
gue that this is an unnecessary restriction, since sun posi- 
tion algorithms 1 12] very accurately estimate the sun po- 
sition given its capture time and geolocation, which it- 
self can be determined automatically by a variety of meth- 
ods (3 [nans). 

Therefore, we assume that a camera has been geolocated 
and accurately timestamped to recover per-image light- 
ing directions Li, . . . , L t , . . . , L n as three-dimensional unit 
vectors. Given this information, we develop an expectation 
maximization approach which solves for the shadows most 
consistent with a Lambertian assumption. Borrowing from 
the image formation models of Q] [2), we use a simple Lam- 
bertian model to represent our scene: 

J t (x) « p(x)(max(L t • 7V(x), 0)S t (x) + A(x)) (5) 

where p(x), iV(x), and A(x) are the albedo, normal vector, 
and skylight (ambient light contribution) of a pixel x, re- 
spectively. To handle color, we represent p(x) as an RGB 
vector, while the skylight remains grayscale. Therefore, our 
goal is to estimate the unknown albedo, surface normal, am- 
bient light, and a shadow labeling for each pixel, given the 
original imagery and associated lighting directions. 



Compared to (UEJ, we do not include any time- varying 
unknowns (such as light intensity, exposure, or ambient in- 
tensity), the camera's unknown radiometric response, or 
attempt to handle non-Lambertian surfaces. Although it 
would be possible to extend this model to handle these 
unknowns, using a simpler model results in a simpler al- 
gorithm which already outperforms current state-of-the-art 
shadow estimation approaches. This simpler model is effi- 
cient enough to be used as pre-processing for optimization 
over a more complete image formation model. 

Our approach alternates between fitting the per-pixel pa- 
rameters p(x),7V(x), and A{x), and updating the shadow 
volume St (x) . 

3.1. Expectation Step 

In the expectation step, we aim to find the expected 
albedo, normals, and skylight, given the shadow volume. 
Performing this expectation over RGB images gives a non- 
linear problem, because the normal vector must be con- 
strained to unit length. In the case of grayscale images, a 
change in notation expresses the Lambertian model as an 
independent system of linear equations for each pixel x. 
Writing p as the grayscale albedo, [a(x), b(x), c(x)] T = 
p(x)7V(x) and d(x) = p(x)A(x), we solve the following 
linear system: 
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(6) 



This n x 4 system of equations therefore solves for the Lam- 
bertian model with a skylight term (i.e. surface normal, sky- 
light, and grayscale albedo) most consistent with the data, 
for a single pixel x. After solving for auxiliary variables 
a, 6, c, and d, we recover the albedo, normal, and skylight 
as: 

/3(x) = V«(x) 2 + Kx) 2 + c(x) 2 , (7) 

. (x ).H4^ !i(x) ,M, (8) 



P(x) 



P(x) ■ 



To handle color images, we first run the algorithm on 
grayscale images to recover grayscale normals and skylight, 
and solve for the best color albedo through a closed-form 
solution: 



Pc(x) = 



■£■ 



J?(x) 



f^ max(L t • JV(x), 0)5 t (x) + A(x) 



, (9) 



where C € {R, G, B} is the color channel of the albedo or 
image. 

Notice that only here do we take this opportunity to 
strictly enforce non-negative Lambertian lighting. Ideally, 




(a) Example images 




(b) Ground truth S (c) Ground truth p (d) Ground truth TV 




(e) Recovered S (f) Recovered p (g) Recovered TV 

Figure 2. Experiments with synthetic data. We use a rendering 
pipeline to create 300 images using a year's worth of simulated 
lighting directions [(a)] Our recovered results from this sequence 
match the ground truth results almost exactly. 



we would also want to solve for the surface normal in Equa- 
tion [6] that satisfies this constraint. Further, since the hinge 
function max(x, 0) is convex, enforcing such a constraint 
would yield a globally-optimal solution. However, solving 
with respect to the hinge loss increases runtime dramati- 
cally, and in practice, the shadow volume St(x) quickly 
stabilizes to cover all times when the pixel is under shadow, 
including attached shadows when L t -N{x) < 0, thus zero- 
ing out the Lambertian term without use of the hinge func- 
tion. 

3.2. Maximization Step 

In the maximization step, we aim to find the maximum- 
likelihood classification of a pixel x at time t as being in 
shadow or not, given the current estimates of albedo, nor- 
mal, and skylight. In our case, we simply evaluate the qual- 
ity of the reconstruction in each case, and choose the best: 



n -<— ||J t (x) - 


- p(x)(max(L t • JV(x),0) + A(x))|| 2 (10) 


r <-||J t (x)- 


-p(x)A(x)|| 2 (11) 




^(x)^! 1 n ^ ro (12) 
otherwise 



3.3. Implementation Details 

We repeatedly alternate between the expectation and 
maximization step until the St (x) labels do not change from 



iteration to iteration, or until 50 iterations have passed. In 
practice, most pixels' labels converge quickly. In all ex- 
periments, more than 50% of the pixels reach convergence 
before 6 iterations, and 99% reach convergence before 20 
iterations. 

In practice, the linear system in Equation [6] can quickly 
become rank-deficient; for example, if one sets St (x) = 
for all t, the system over four variables reduces to rank 
one. Intuitively, this makes sense, since recovery of sur- 
face normals is numerically impossible for a pixel consis- 
tently under shadow. When the assignment of £t(x) yields 
a singular matrix in Equation [6j we tried many methods 
of resetting St (x) to regain full-rank, including a full re- 
set St(x) = 1 (i.e., pixel x is directly lit all the time) or 
St(x) = (pixel x always shaded, effectively giving up 
on estimating albedo and normals for this pixel). However, 
we found that the most accurate results came from an incre- 
mental re-assignment, which chooses the time t so that the 
pixel x is brightest yet shadowed and reassigns 5t(x) = 1, 
repeating until the resulting linear system is full rank. This 
reassignment is done at each iteration before performing the 
expectation ster^] 

We additionally experimented with many initialization 
procedures, including assuming all pixels are always di- 
rectly lit, and random initialization. The best method we 
found was to initialize St(x) = 1, except for the t where 
It(x) is minimal, using St(x) = 0. 

In real sequences, pixels become saturated as i*(x) = 
255, breaking the color linearity assumption of Equation]?] 
In this work, we replace any saturated it(x) with the ex- 
pected intensity consistent with the color linearity assump- 
tion; see the supplemental material for details. 

Runtime is largely dependent on the number of non-sky 
pixels and the length of the image sequence. To give a rough 
estimate for runtime, performing this algorithm on 100 im- 
ages at 512x380 resolution (optimizing over 132,000 non- 
sky pixels) takes 2 minutes and 33 seconds on a 2.66GhZ 
Intel Xeon processor with 12 GB of memory across 8 cores. 
Since the EM optimization for each pixel is independent 
from any other pixel, each optimization is performed in par- 
allel. 

4. Results 

Here, we describe the results of our algorithm in con- 
trolled and uncontrolled environments, and show how this 
shadow estimation procedures fits in with the larger field of 
outdoor photometric image formation. 



2 After a pixel has been reassigned and gone through one iteration of 
expectation and maximization steps, if its resulting labeling is equal to 
the labeling before reassignment, we declare that pixel as converged and 
accept that labeling. Therefore, a pixel under shadow at all times will be 
correctly labeled, albeit with a rank deficiency. 
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Figure 3. Sensitivity experiments of the proposed approach on various synthetic datasets. In |(a)[ we use 300 images from a simulated 
year-long period and distort the input sequence by a variety of nonlinear camera response functions from |4|. The color of each curve is 
the accuracy of the shadow classification task after distorting the sequence by that response function. The plots are split into three separate 
plots using the same colormap for easier visualization. In |(b)[ we simultaneously vary the number of images and length of time used in the 
sequence and report accuracy of the shadow-or-not classification task. 
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4.1. Controlled Environments 

To test our method, we created a synthetic dataset using 
the image formation model of Section |3j simulating light- 
ing directions from a virtual camera^ Using a rendering 
pipeline to create 300 images over the span of a simulated 
year, we ran our algorithm and recovered the exact solution 
that generated the data; see Figure [2] We recover the cor- 
rect shadows with 99.79% accuracy, with most pixels never 
making a single mistake. Although the goal is to recover 
shadows, we also recover the correct albedo and normals to 
0.29 intensities (from 0-255) and 0.20°, respectively. 

Real webcam images usually suffer from nonlinear cam- 
era response, but our image formation model does not ac- 
count for such distortions. Including unknown response in 
our model would make the optimization more complex, as 
a small change in the response affects all pixels in the time- 
series. To test robustness against the unknown camera re- 
sponse, we distorted our data by all of the functions from 
the Database of Response Functions from Grossberg and 
Nayar [4] and re-ran our shadow estimation algorithm. As 



shown in Figure [3(a)] nonlinear camera response has a neg- 
ligible effect on shadow estimation: applying a response 
function to synthetic data usually decreases accuracy by less 
than 1%, and at worst, less than 5%. 

As noted in (T), accurate recovery of surface normals 
from a limited data set is challenging. Therefore, we per- 
form experiments to see how long or short of a time period 
is required to recover accurate shadow volumes. We take a 
sequence and perform the proposed approach on different 
numbers of images n, as well as the lengths of the original 
sequence (e.g. an hour, a day, a week). We repeat the ex- 
periment for different random subsets of imagery and report 



average accuracy over 10 trials. Figure 3(b) demonstrates 
that while having a longer clock time improves the result, 



3 We originally considered using the available ground truth synthetic 
data from 1 6 1, but their ground truth labels only include cast shadows, never 
attached shadows, and our algorithm does not make such a distinction. 



the number of images used is much more important, as long 
as the original sequence has at least 25 images across a few 
hours, we reliably recover the correct shadow volume. For 
all results presented in this paper, we use 50-100 images 
taken over the span of many months. 

These results contradict |1], which states that explicit 
radiometric modeling and input imagery spanning many 
weeks is necessary to recover good surface normals. We 
attribute this property to the difference in appearance for 
shaded vs. directly-lit pixels: although we may not know 
what surface normal or radiometric curve describes the in- 
tensity of a pixel under direct sunlight, the intensity dif- 
ference between a pixel's expected intensity in and out of 
shadow is substantial enough to accurately disambiguate the 
two (given a lighting direction). 

4.2. Uncontrolled Environment 

The experiments in the previous section describe how 
our algorithm performs in synthetic environments. The real 
world, however, has many error modes that cannot easily 
be exhaustively enumerated in such an environment. Given 
that our goal is to recover shadows from real time-lapse 
sequences, we report performance of shadow classification 
with respect to real scenes, taken from the Archive of Many 
Outdoor Scenes (AMOS) webcam dataset [9|. 



4.2.1 The Labeled Shadows in the Wild Dataset 

To report quantitative measurements, we selected 7 scenes 
from the AMOS dataset and labeled ground truth shadow 
masks for 50 images each. Our labeled data comes from a 
variety of cameras across the globe, including a busy plaza 
in Barcelona, a university in Arizona, and a camera in Ger- 
many that observes complicated geometry. These cameras 
break many of the assumptions that we make in our shadow 
estimation procedure, through the inclusion of atmospheric 
haze, time-variable geometry (most notably in the plaza 
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Figure 4. Results from various shadow estimation approaches, with imagery taken from the AMOS dataset |9|. From top to bottom: 
an original image, the Factored Time Lapse Video approach [13], the Heliometric Stereo approach |1|, and our results. Each shadow 
estimation approach shows the estimated shadow mask for the given original frame. Although the approach from 1 1 ] does better, it makes 
many subtle errors (circled in pink). Our approach makes fewer mistakes than the other two methods, and accurately recovers both large- 
scale and fine detail in shadow patterns. Although our goal is to recover shadows, our EM algorithm simultaneously recovers normals and 
albedos as a byproduct (last two rows). We show normals in an East-North-Up coordinate frame, using the color map from 1 1 ], where the 
hue codes for geographic orientation and the lightness represents degrees away from the zenith. 



where pedestrians and flea markets occupy large parts of the 
image), variable exposure, and nonlinear camera response. 

Labeling such a dataset is itself nontrivial. In select- 
ing images from each camera, we use the automatic im- 
age selection algorithm of [2) to avoid human bias and 
select clear-day images with easily-discriminable shadow 
boundaries. The scenes we use demonstrate considerable 
complexity in scene geometry and noise factors, and hand- 
labeling millions of pixels is challenging for even the most 
experienced graduate student. We therefore take advantage 
of the natural color distribution of shadows described in [5 1 
and label only a sparse sample of pixels for which we are 
confident. These sparse labels are then propagated using the 
matting equation of [11], creating an a-mat across the im- 
age. We label any pixel with a > 0.7 as directly lit, and 
a < 0.3 as shadowed. All other pixels are left unlabeled. 
All labels were hand- verified before experimentation. See 
the supplemental material for example labels. 

To facilitate future comparison in time-lapse shadow es- 
timation, the Labeled Shadows in the Wild (LSW) dataset 
and our code is available at [anonymized] . 

4.2.2 Evaluation 

To give the best possible performance of alternative exist- 
ing algorithms, we use the LSW data for cross-validation 
and then test on the same dataset. We choose parameters 
with three strategies. The "Suggested" strategy uses the 
parameters prescribed in their respective papers. For the 
"Global" strategy, we select a single set of parameters that 
maximizes each algorithm's average accuracy across the 7 
scenes. For the "Optimal" strategy, we tweak parameters 
for each scene individually to maximize accuracy. As our 
approach is parameter- free, no cross-validation is necessary. 

Interestingly, the parameters suggested by QH3), origi- 
nally set empirically, are not quite the same as recovered in 
the cross-validation step. Although the FTLV approach has 
suggested parameters (0 pj 0k) = (0.2, 1.5), we found that 
for this dataset, the best parameters were (0.05, 1.5). The 
HS approach suggests (Q p ,6\) = (0.8,0.05), whereas the 
optimal values we use are (0.98, 0.01). 

Our quantitative results are shown in Table [T] These 
show that our approach and [lj perform roughly equally 
when parameters have been chosen to maximize accuracy 
per dataset. However, we stress that these numbers repre- 
sent the best-case performance of the other methods, and 
that using any other parameter setting deteriorates their per- 
formance, sometimes dramatically (in the case of the default 
suggested parameters). 

A qualitative comparison is shown in Figure [4] We 
select 100 images from several cameras from the AMOS 
dataset [9 ], again using the image selection algorithm of Q, 
a multi-scale alignment procedure from [7] and perform 



Algorithm 


Suggested 


Global 


Optimal 


FTLV (EI 

HSffl 

Our approach 


74.22 
84.58 
87.13 


74.94 
86.76 
87.13 


82.03 
87.40 

87.13 



Table 1 . Accuracy of various approaches (higher is better) on the 
Labeled Shadows in the Wild dataset, in percent. In the "Sug- 
gested" column, we use the parameters as reported in their re- 
spective papers. In "Global", we set the parameters of the first 
two approaches by treating the test dataset as a cross-validation 
set to maximize accuracy across all scenes. Finally, in "Optimal", 
we optimize a set of parameters separately for each scene. Our 
approach is parameter-free and does not require any such cross- 
validation. 



each of shadow estimation procedures. Because the FTLV 
algorithm is designed for a single day's worth of imagery, 
the resulting shadow masks are understandably unusable. 
The HS approach does better, but often shades too much of 
the scene (most visibly in columns 1 and 3 of Figure]?]). 



4.3. Initialization for 
Analysis 



Time-Lapse Photometric 



Estimating the shadow volume for a sequence is often a 
first step for more in-depth photometric analysis of a time- 
lapse scene (UElEl- In each of these works, shadow es- 
timation is considered a pre-process during the initializa- 
tion step. We use the code from [ 1 ], which solves for sur- 
face normals and albedo from a time-lapse scene, but also 
simultaneously recovers estimates for per-image exposure 
and radiometric response functions. To test the practicality 
of our routine, we initialize the optimization in two different 
ways: one using their suggested initialization, and another 
using our proposed approach. We then let the optimization 
continue until convergence; the only difference is in the ini- 
tialization routine. 

We compare the resulting normals from each optimiza- 
tion to Google Earth ground truth models. This compari- 
son, visualized in Figure [5] shows a few important details. 
First, although the errors for both initialization routines ap- 
pear very large, most of these errors are coming from ar- 
eas of the scene not modeled well by Google Earth, and 
the errors from well-modeled surfaces are less than 10 de- 
grees. Further, initializing with the proposed EM algorithm 
yields much more accurate surface normals than previously 
reported: the peak of surface normal error shifts from 20 
degrees to less than 10 degrees. This emphasizes the impor- 
tance of estimating shadows in these larger pipelines. 

5. Conclusions 

In this work, we introduce a method for classifying a 
pixel as being directly lit or in shadow in real outdoor time- 
lapse sequences. Our expectation-maximization approach 
is parameter- free and outperforms previous methods. To 
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Figure 5. Comparing errors in normal estimates by initializing the optimization in [1 1 with their suggested method [(a)] and the proposed 
method |(b)| While ground truth data from Google Earth [(c)] does not contain real- world normal variation in such as tree leaves and 
individual windows, it provides a convenient surrogate for estimating normal accuracy for scenes in the wild. Although both approaches 
appear to have substantial errors, as noted in 1 1 ], these largely come from low-detail polygons in Google Earth models. Initializing the 
optimization with our approach, however, substantially decreases angular error |(d)|(e)] 



validate our algorithm, we perform synthetic experiments 
to show that our approach is robust to nonlinear camera re- 
sponse and is invariant to sequence length. We also intro- 
duce the Labeled Shadows in the Wild dataset, which offers 
a standard basis for future work to evaluate shadow esti- 
mation in the face of noisy signals in real outdoor scenes. 
We show that our approach improves normal field accuracy 
when used as an initialization step for richer image forma- 
tion model inference. 

Detecting shadows is a critical piece of any visual sys- 
tem, and although previous state-of-the-art clever threshold- 
ing works well in some circumstances, optimizing over the 
shadow process in the image formation model is an impor- 
tant part of outdoor time-lapse analysis. 
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Supplemental Material 

A. Saturated Pixels 

As described in the main body of the text, real cameras 
tend to have saturated color channels where it(x) = 255, 
breaking color linearity. Note that we can rewrite the image 
formation model as 

2t(x) = c(x)a xt 

for some color vector c and scalar a. Therefore, if the im- 
ages fit our model, the unsaturated colors are linear in RGB 
space. 

To handle saturation, we replace any saturated I t (x) with 
the intensity that best fits the color linearity model. We first 
estimate the color vector c(x) of each pixel as the mean of 
/(x), excluding any ts where any channel of it(x) is satu- 
rated. Then, we estimate the per-time scalar a t as the so- 
lution to the linear system it(x) = ac(x), again excluding 
any saturated channels. Finally, we replace any saturated 
it(x) as ftc(x). If each of the color channels are saturated 
(i.e., the pixel is pure white), then we fix that pixel as being 
directly-lit, and do not attempt to optimize its label. 

B. Labeled Shadows in the Wild 

Here, we provide example labels from the Labeled Shad- 
ows in the Wild dataset, described in the main body of the 
text. Each figure contains two images from a single cam- 
era, and shows an example image and its label. On the 
example image, violet borders indicate unknown pixel in- 
tensities, due to timestamps and alignment. On the labeled 
image, black indicates a shadowed pixel, white indicates a 
pixel under direct illumination from the sun, and gray val- 
ues are unknown. 



Figure 6. Labeled results from a webcam in Columbia, Missouri. 




Figure 7. Labeled results from a webcam at a plaza in Barcelona. 




Figure 8. Labeled results from a webcam in Walldum, Germany. 




Figure 9. Labeled results from a webcam in Erfurt, Germany. 




Figure 10. Labeled results from a webcam in Meersburg, Germany. 



